12 research outputs found

    Data center's telemetry reduction and prediction through modeling techniques

    Get PDF
    Nowadays, Cloud Computing is widely used to host and deliver services over the Internet. The architecture of clouds is complex due to its heterogeneous nature of hardware and is hosted in large scale data centers. To effectively and efficiently manage such complex infrastructure, constant monitoring is needed. This monitoring generates large amounts of telemetry data streams (e.g. hardware utilization metrics) which are used for multiple purposes including problem detection, resource management, workload characterization, resource utilization prediction, capacity planning, and job scheduling. These telemetry streams require costly bandwidth utilization and storage space particularly at medium-long term for large data centers. Moreover, accurate future estimation of these telemetry streams is a challenging task due to multi-tenant co-hosted applications and dynamic workloads. The inaccurate estimation leads to either under or over-provisioning of data center resources. In this Ph.D. thesis, we propose to improve the prediction accuracy and reduce the bandwidth utilization and storage space requirement with the help of modeling and prediction methods from machine learning. Most of the existing methods are based on a single model which often does not appropriately estimate different workload scenarios. Moreover, these prediction methods use a fixed size of observation windows which cannot produce accurate results because these are not adaptively adjusted to capture the local trends in the recent data. Therefore, the estimation method trains on fixed sliding windows use an irrelevant large number of observations which yields inaccurate estimations. In summary, we C1) efficiently reduce bandwidth and storage for telemetry data through real-time modeling using Markov chain model. C2) propose a novel method to adaptively and automatically identify the most appropriate model to accurately estimate data center resources utilization. C3) propose a deep learning-based adaptive window size selection method which dynamically limits the sliding window size to capture the local trend in the latest resource utilization for building estimation model.Hoy en día, Cloud Computing se usa ampliamente para alojar y prestar servicios a través de Internet. La arquitectura de las nubes es compleja debido a su naturaleza heterogénea del hardware y está alojada en centros de datos a gran escala. Para administrar de manera efectiva y eficiente dicha infraestructura compleja, se necesita un monitoreo constante. Este monitoreo genera grandes cantidades de flujos de datos de telemetría (por ejemplo, métricas de utilización de hardware) que se utilizan para múltiples propósitos, incluyendo detección de problemas, gestión de recursos, caracterización de carga de trabajo, predicción de utilización de recursos, planificación de capacidad y programación de trabajos. Estas transmisiones de telemetría requieren una utilización costosa del ancho de banda y espacio de almacenamiento, particularmente a mediano y largo plazo para grandes centros de datos. Además, la estimación futura precisa de estas transmisiones de telemetría es una tarea difícil debido a las aplicaciones cohospedadas de múltiples inquilinos y las cargas de trabajo dinámicas. La estimación inexacta conduce a un suministro insuficiente o excesivo de los recursos del centro de datos. En este Ph.D. En la tesis, proponemos mejorar la precisión de la predicción y reducir la utilización del ancho de banda y los requisitos de espacio de almacenamiento con la ayuda de métodos de modelado y predicción del aprendizaje automático. La mayoría de los métodos existentes se basan en un modelo único que a menudo no estima adecuadamente diferentes escenarios de carga de trabajo. Además, estos métodos de predicción utilizan un tamaño fijo de ventanas de observación que no pueden producir resultados precisos porque no se ajustan adaptativamente para capturar las tendencias locales en los datos recientes. Por lo tanto, el método de estimación entrena en ventanas corredizas fijas utiliza un gran número de observaciones irrelevantes que produce estimaciones inexactas. En resumen, C1) reducimos eficientemente el ancho de banda y el almacenamiento de datos de telemetría a través del modelado en tiempo real utilizando el modelo de cadena de Markov. C2) proponer un método novedoso para identificar de forma adaptativa y automática el modelo más apropiado para estimar con precisión la utilización de los recursos del centro de datos. C3) proponer un método de selección de tamaño de ventana adaptativo basado en el aprendizaje profundo que limita dinámicamente el tamaño de ventana deslizante para capturar la tendencia local en la última utilización de recursos para el modelo de estimación de construcción.Postprint (published version

    Performance Characterization of Spark Workloads on Shared NUMA Systems

    Get PDF
    As the adoption of Big Data technologies becomes the norm in an increasing number of scenarios, there is also a growing need to optimize them for modern processors. Spark has gained momentum over the last few years among companies looking for high performance solutions that can scale out across different cluster sizes. At the same time, modern processors can be connected to large amounts of physical memory, in the range of up to few terabytes. This opens an enormous range of opportunities for runtimes and applications that aim to improve their performance by leveraging low latencies and high bandwidth provided by RAM. The result is that there are several examples today of applications that have started pushing the in-memory computing paradigm to accelerate tasks. To deliver such a large physical memory capacity, hardware vendors have leveraged Non-Uniform Memory Architectures (NUMA). This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes. We explore several workloads run on top of the IBM Power8 processor, and provide manual strategies that can leverage performance improvements up to 40% on Spark workloads when using smart processor-pinning and workload collocation strategies.This work is partially supported by the European Research Council (ERC) under the EU Horizon 2020 programme (GA 639595), the Spanish Ministry of Economy, Industry and Competitiveness (TIN2015-65316-P) and the Generalitat de Catalunya (2014-SGR-1051).Postprint (author's final draft

    Adaptive sliding windows for improved estimation of data center resource utilization

    Get PDF
    Accurate prediction of data center resource utilization is required for capacity planning, job scheduling, energy saving, workload placement, and load balancing to utilize the resources efficiently. However, accurately predicting those resources is challenging due to dynamic workloads, heterogeneous infrastructures, and multi-tenant co-hosted applications. Existing prediction methods use fixed size observation windows which cannot produce accurate results because of not being adaptively adjusted to capture local trends in the most recent data. Therefore, those methods train on large fixed sliding windows using an irrelevant large number of observations yielding to inaccurate estimations or fall for inaccuracy due to degradation of estimations with short windows on quick changing trends. In this paper we propose a deep learning-based adaptive window size selection method, dynamically limiting the sliding window size to capture the trend for the latest resource utilization, then build an estimation model for each trend period. We evaluate the proposed method against multiple baseline and state-of-the-art methods, using real data-center workload data sets. The experimental evaluation shows that the proposed solution outperforms those state-of-the-art approaches and yields 16 to 54% improved prediction accuracy compared to the baseline methods.This work is partially supported by the European ResearchCouncil (ERC) under the EU Horizon 2020 programme(GA 639595), the Spanish Ministry of Economy, Industry andCompetitiveness (TIN2015-65316-P and IJCI2016-27485), theGeneralitat de Catalunya, Spain (2014-SGR-1051) and Universityof the Punjab, Pakistan. The statements made herein are solelythe responsibility of the authors.Peer ReviewedPostprint (published version

    Adaptive prediction models for data center resources utilization estimation

    Get PDF
    Accurate estimation of data center resource utilization is a challenging task due to multi-tenant co-hosted applications having dynamic and time-varying workloads. Accurate estimation of future resources utilization helps in better job scheduling, workload placement, capacity planning, proactive auto-scaling, and load balancing. The inaccurate estimation leads to either under or over-provisioning of data center resources. Most existing estimation methods are based on a single model that often does not appropriately estimate different workload scenarios. To address these problems, we propose a novel method to adaptively and automatically identify the most appropriate model to accurately estimate data center resources utilization. The proposed approach trains a classifier based on statistical features of historical resources usage to decide the appropriate prediction model to use for given resource utilization observations collected during a specific time interval. We evaluated our approach on real datasets and compared the results with multiple baseline methods. The experimental evaluation shows that the proposed approach outperforms the state-of-the-art approaches and delivers 6% to 27% improved resource utilization estimation accuracy compared to baseline methods.This work is partially supported by the European Research Council (ERC) under the EU Horizon 2020 programme (GA 639595), the Spanish Ministry of Economy, Industry and Competitiveness (TIN2015-65316-P and IJCI2016-27485), the Generalitat de Catalunya (2014-SGR-1051), and NPRP grant # NPRP9-224-1-049 from the Qatar National Research Fund (a member of Qatar Foundation) and University of the Punjab, Pakistan.Peer ReviewedPostprint (published version

    Performance characterization of spark workloads on shared NUMA Systems

    No full text
    As the adoption of Big Data technologies becomes the norm in an increasing number of scenarios, there is also a growing need to optimize them for modern processors. Spark has gained momentum over the last few years among companies looking for high performance solutions that can scale out across different cluster sizes. At the same time, modern processors can be connected to large amounts of physical memory, in the range of up to few terabytes. This opens an enormous range of opportunities for runtimes and applications that aim to improve their performance by leveraging low latencies and high bandwidth provided by RAM. The result is that there are several examples today of applications that have started pushing the in-memory computing paradigm to accelerate tasks. To deliver such a large physical memory capacity, hardware vendors have leveraged Non-Uniform Memory Architectures (NUMA). This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes. We explore several workloads run on top of the IBM Power8 processor, and provide manual strategies that can leverage performance improvements up to 40% on Spark workloads when using smart processor-pinning and workload collocation strategies.Peer Reviewe

    Prioritization of Exigency Services in Multi-Agent Transportation Systems

    No full text
    Multi-agent system helps in achieving a single global goal by working on different tasks in a distributed environment. This research presents a new framework for handling the prioritization of exigency services in an urban transportation system. Prioritization in exigency services needs to be handled for vehicles like Ambulance, Police Mobile, Fire Brigade, Bomb Disposal Squads, Search and Rescue vehicles, Turntable Ladder, and other similar vehicles. In the proposed framework a single ARTIS agent with four In-agents is deployed at each signal node. In-Agents at different nodes share information about the exigency with the ARTIS Agent that analyses the input data and determines the types of conflict that might cause delays in emergency services provision. We operate the signal based on their priorities for exigency vehicles with different priorities. In case they have the same priorities then we use the lane congestion and vehicle wait time to operate a signal. We demonstrate the application of our proposed approach using different cases of a traffic case study.

    Real-time data center's telemetry reduction and reconstruction using Markov chain models

    Get PDF
    Large-scale data centers are composed of thousands of servers organized in interconnected racks to offer services to users. These data centers continuously generate large amounts of telemetry data streams (e.g., hardware utilization metrics) used for multiple purposes, including resource management, workload characterization, resource utilization prediction, capacity planning, and real-time analytics. These telemetry streams require costly bandwidth utilization and storage space, particularly at medium-long term for large data centers. This paper addresses this problem by proposing and evaluating a system to efficiently reduce bandwidth and storage for telemetry data through real-time modeling using Markov chain based methods. Our proposed solution was evaluated using real telemetry datasets and compared with polynomial regression methods for reducing and reconstructing data. Experimental results show that data can be lossy compressed up to 75% for bandwidth utilization and 95.33% for storage space, with reconstruction accuracy close to 92%.This work was supported in part by the European Research Council (ERC) under the EU Horizon 2020 programme under Grant GA 639595, in part by the Spanish Ministry of Economy, Industry and Competitiveness under Grant TIN2015-65316-P and Grant IJCI2016-27485, in part by the Generalitat de Catalunya under Grant 2014-SGR-1051, in part by the University of the Punjab, Pakistan, and in part by the Qatar National Research Fund (a member of Qatar Foundation) under NPRP Grant # NPRP9-224-1-049.Peer Reviewe
    corecore